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ABSTRACT 

Evidence has shown that student’s attention is a crucial fac- 
tor for engagement and learning gain. Although it can be 
accurately assessed ad-hoc by an experienced teacher, con- 
tinuous contact with all students in a large class is difficult 
to maintain and requires training for novice practitioners. 
We continue our previous work on investigating unobtrusive 
measures of body- language in order to predict student’s at- 
tention during the class, and provide teachers with a support 
system to help them to “scale-up” to a large class. 

Our work here is focused on head-motion, by which we aim 
to mimic large-scale gaze tracking. By using new computer 
vision techniques we are able to extract head poses of all 
students in the video-stream from the class. After defining 
several measures about head motion, we checked their signif- 
icance and attempted to demonstrate their value by fitting a 
mixture model and training support vector machines (SVM) 
classifiers. We show that drops in attention are reflected in 
a decreased intensity of head movement. We were also able 
to reach 61.86% correct classifications of student attention 
on a 3-point scale. 
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1. INTRODUCTION 

One of the early studies of attention in classrooms showed 
that only 46% of students pay attention during the class [4] . 
Later studies raised that estimation to a more optimistic 
but still insufficient 67% [20]. This means that in practice 
the teachers are lecturing half-empty classrooms, even if all 
chairs are occupied. How can we help the teachers learn to 
recognize which chairs are empty? 


Processing of social cues comes natural in human-to-human 
communication, but still remains an object of much research 
and few technical applications. The ambiguity of the medium 
limits our attempts, but in the scenarios where body lan- 
guage becomes the dominant form of expression, we are in- 
clined to dig further into the matter. One such scenario is 
the classroom. We argue that computer vision (CV) tech- 
nologies, in combination with machine learning approaches 
give us tools to scale-up teacher’s attention to every student 
in the classroom, regardless of the class size. This would 
provide the teachers with a timely opportunity to address 
lower attentive class areas and draw students into the lec- 
ture, encouraging teacher’s reflection in action. 

Behaviour of people in large groups is unpredictable to an 
observer in most situations. The overwhelming amount of 
information forces us to focus on few individuals who we 
deem as the representatives of the group, and mental effort 
and training are required to re-divide the attention equally 
among many subjects [7]. In case of a lecture, teachers are 
active participants, splitting their attention between per- 
sonal actions, material presentation and orchestration of the 
whole process [8]. 

In this work we started from the success of eye-tracking in 
predicting focus and tried to generalize it to students’ head 
movement in the classroom. Birmingham et al [3] illustrate 
the social aspect of gaze - given an image, people first anal- 
yse the gaze, then the head and finally the posture of the 
people in the image to collect information about where to 
focus their attention. Langton [13] showed that we combine 
the input from head and eyes into a single stimulus. These 
two observations together gave us the ground to consider 
head orientation as i ) informative to other humans, and thus 
potentially also for our algorithms; ii ) an approximation of 
human gaze on larger scales of motion. 

In this paper we present our process for extracting head 
motion and pose features from videos of classroom audience, 
and our initial set of analysis of the features’ quality. We will 
try to answer if there is a general connection between head 
motion and attention level? What are the features of head 
motion that we can use in predicting attention? How do 
these features change with attention levels? And finally, can 
we use these features to predict students attention levels? 
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2. RELATED WORK 

The umbrella of affective computing [15] has been growing 
in the last 15 years, and expanding the domains of it’s appli- 
cation. The emerging sub-field of Social Signal Processing 
(SSP) [24, 25] made a major point of emphasizing that en- 
coding human social and cultural information might raise 
the performance of the machine algorithms aimed at un- 
derstanding behaviour (e.g. analysing large sport gathering 
[ 6 ])- 

In case of human attention, it is attributed with the ability 
to modulate or enhance the selected information source ac- 
cording to the state and goals of the perceiver, and that the 
“perceiver becomes an active seeker and processor of infor- 
mation, able to intelligently interact with their environment” 
[5] and can be highly relevant in a learning environment [14] . 
Roda et al [19] already tried to incorporate the attention in- 
dication as one of the inputs in human-computer interaction, 
but early attempts in the classroom were not formulated as 
a technology which can be wide-spread, due to their com- 
plexity [1]. 

Detecting and displaying the gaze direction, as one of the key 
indicators of focus of attention, was shown to be both useful 
in making the interaction feel more natural [23] , and indica- 
tive of the material comprehension [21] in on-line environ- 
ments. Lacking the possibility of capturing gaze in a real-life 
scenario, Ba et al [2] demonstrated that we can estimate the 
VFOA (visual focus of attention) in meetings successfully 
based on the head pose. In the similar scenario Stiefelhagen 
et al [22] showed that head orientation contributes 68.9% in 
the overall gaze direction (where is the attention directed) 
and achieved 88.7% accuracy at determining the focus of at- 
tention. This gives us the indication that head motion has 
potential as a focus indicator, but it does not come with- 
out problems. Deeper exploration of head motion depicts 
it as an ambiguous indicator. Heylen’s overview [10] shows 
that head-signals are either very contextual-dependant or 
are complementary signal to the main information channel 
(usually - talking). 

Our conclusion from the literature overview is that head 
motion has the potential as a low-resolution measurement 
which we can passively acquire to determine the attention 
level and/or direction of another person. To fully decode 
it we need contextual information which will be unavailable 
in our approach of passive/unobtrusive data collection [16]. 
The features we hope to find need to be positioned in the 
middle between measurable and context-dependant. 

3. METHOD 

Training and validation of our head detector/pose estima- 
tor pipeline was detailed in our previous work [17]. We will 
give a quick overview of the experiment setup and detection 
pipeline, and focus on the steps and problems we encoun- 
tered in the later stages of data extraction. 

3.1 Experiment design 

We collected a total of 6 recorded sessions with 2 classes 
(demographic information shown in Table 1). Each class- 
room was observed with several cameras positioned above 
teacher’s head around the blackboard area of the classroom 
(camera view of the classroom is shown in Figure 1). The 



Figure 1: Examples of gaze detections, showing the 
classroom during the lecture. 

cameras were synchronized and each student visible in the 
video was annotated with an unique ID (maintained over 
all recorded sessions) and a rectangular area of the video 
which the student occupies. Given that the angle of the face 
detected is relative to the camera viewpoint, we introduced 
angle offsets for each student. If a student was visible from 
several cameras, best quality recording was used. 


Class 

Size 

F. ratio 

Mean attend. 

Sess 

Cams 

1 

62 

35.48% 

39.34(a = 1.15) 

3 

5 

2 

43 

34.88% 

27.5 (a = 6.55) 

3 

4 


Table 1: Statistics of the two captured classes, show- 
ing the number of students, percentage of female 
students, attendance, number of sessions recorded 
and number of cameras used. 

Similar to attention probing used in earlier experiments [4] 
we asked students to fill out the questionnaire about their 
attention during the class. At four different times the classes 
were interrupted and students recorded their attention on a 
Likert scale from 1-10 (details of the questionnaire design 
are presented in [17]). The distribution of all collected an- 
swers is shown in Figure 2. From each of the 6 processed 
classes we recorded 4 measurements of attention per stu- 
dent, associated to the time period before our interruption, 
duration of 7-10 minutes. In order to turn the problem into 
a classification one, we labelled the values of the students’ 
responses as low (reported attention 1-4), medium (5-7) or 
high attention (8-10), based on our observations of attention 
distribution (regions marked in Fig. 2). 

3.2 Video analysis 

The head-pose detection and pose estimation was built on 
top of the part-based model for head detection published 
by Zhu et al [26] which was re-trained for lower resolution 
images and different head poses on the AFLW dataset [12]. 
We trained a geometrical head-pose estimator (focusing on 
horizontal angle or “pan” of the head) by using the dlib li- 
brary [11]. The precision of the estimators was checked on 
the Pointing’04 dataset [9]. Each detection consists of the 
assumed rectangle of face area, estimated angle of the face 
(“pan”) and score (detector confidence). 

The major problem for reaching the meaningful measure- 
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Figure 2: Histogram of all reported levels of atten- 
tion with the used limits to designate the low (red 
zone, <5), medium (yellow 5-7) and high (green, 8- 
10) levels of attention. 


merits was the instability of the detector/estimator output. 
The measurements were very noisy since the feature extrac- 
tion step was not formulated as a tracker, which would pro- 
vide temporal consistency. The second problem came from 
the setup itself — given the location of the cameras (around 
the black-board, visible in Figure 1), the subjects sit closely 
together. This causes a considerable amount of i) inter- 
personal occlusions and ii ) gaps in detection and Hi) miss- 
assignment of detection instances (visualized in Figure 3a). 

Simple attempts to pick the best-scoring detection within 
the region did not yield a stable output, given that on most 
occasions the head of the neighbouring student would wan- 
der into the region and take over as the best detection. Fit- 
ting prior distributions (2D Gaussians) for expected head 
locations also did not improve the assignment, as students 
usually create 2 or 3 mixtures of points (depending on their 
sitting poses) , which is indistinguishable from the case when 
two people occupy the given space. 

Finally we settled for the formulation with labelled GMM 
(Gaussian Mixture Model). By taking sparsely sampled de- 
tections over time (one frame every 2 seconds) and accumu- 
lating all the detections, we depicted the overall probability 
of detecting faces in different positions of the camera view. 
The “labelled” part consists of manually specifying the rele- 
vance of each mixture in the probability, by either labelling 
the mixture as a specific person or miss-detection. With this 
we could filter-out all the irrelevant detections for a specific 
person by only considering detections which were assigned 
to one of the person-related clusters in the GMM (Figure 
3b). 

To improve the precision of the GMM fits, before training 
the model we eliminated the outlier points by thresholding 
the minimal number of neighbours a point needs to have in 
order for it to be further considered. This is possible due 
to the fact that the people remain in distinct positions for 
long periods of time, causing dense groupings of detections. 
The threshold was dynamically determined for each video, 


by eliminating the 0.5% of points with lowest number of 
neighbours. The major role of the GMM filtering step was to 
eliminate false positives, as the clusters could not always be 
mapped one-to-one to an individual. Additional constraints 
during the GMM training phase could solve this problem. 

After filtering out the miss-detections, temporal consistency 
was ensured by using a simplified Kalman filter approach - 
the next detection is expected to be in the close proximity of 
the previous detection. If no detections were observed within 
a specified radius from the previous detection, the radius is 
increased for the next processed frame and no detection is 
reported, simulating the increase in uncertainty. The major 
differences from the Kalman filter is the absence of motion 
model (the face is expected to remain at the same place) 
and the lack of probability propagation. This enabled us 
to use only the real detections and not estimates, which is 
relevant in order to model the heads in a bow-down position. 
The region growing was preferred over moving Gaussian in 
order to put a hard limit on the detections which can be 
considered. 

After each processed person in the video, to make sure that 
the detection would not be used two times, we removed the 
detection after it has been assigned to a person. This turns 
the algorithm into a greedy approach, and making the or- 
der in which the persons are processed important. We chose 
to process the persons from front-to-back given that each 
person sitting closer to the cameras is more likely to be cor- 
rectly detected. After extracting detection tracks for each 
person, values of the detection rectangle position and gaze 
angle are smoothed with a “sliding window” approach. 

3.3 Features extracted 

The input features used in our predictions were largely based 
on the information extracted from the cameras, but not ex- 
clusively. All features used are shown in Table 3.3. As we 
noted before, the time and spatial arrangement also plays 
significant role in the attention estimation [18], so we in- 
cluded the information about the distance of the student 
from the teacher (distance and row fields), and time of the 
sample within the class (period). 

We tried to model the eye contact in the class with the 
percentage of time that we detected the student’s face in 
the video. Initial assumption is that this would allow us to 
measure the time the student spent looking down just by 
noting how long was the head absent. The noise in the mea- 
surement originates from the false negatives of the detector, 
which is dominantly influence by the distance from the cam- 
era. Even though we resorted to using zoom-lenses for the 
distant people in the class (which makes the measurements 
comparable even on the capture level to the people in the 
front rows), there still was a significant correlation between 
the row in which the student sat and percentage of time de- 
tected (r = —0.1867, p = 0.009), although it was weaker 
than the correlation with the Cartesian distance from the 
teacher (r = —0.2137, p = 0.002) which encodes width as 
well as depth of the classroom. 

“Head travel” records the total accumulated head travel in 
the horizontal plane. We ignored the potential head-travel 
in the periods when we did not detect the face of the stu- 
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dent. In order to neutralize the potential influences of per- 
son’s rhythm and distance from camera, we also included a 
normalized version of the measure, by using all the measure- 
ments of a single person to determine the mean and scaled 
it with the variance of those measurements. Samples with a 
single measurement were excluded. 

We modelled the focus of the student with 3 connected mea- 
sures of stillness - number of still periods, mean duration of 
the still period and percentage of time spent still. Stillness 
was defined as periods during which the head changes are 
less than 10°, and where the head’s angle does not move 
away from the initial angle more than 10° (in order to pre- 
vent slow drifting to be classified as stillness). “Stillness 
periods” are defined as non-overlapping periods of minimum 
duration of 5 seconds, in which the stillness condition is true. 
From there we get the first two measures by counting the 
number of such periods and their mean duration. Percent- 
age of time spent still is the ratio of time classified as being 
still over the duration of the attention period. 

All measurements were considered per attention period and 
per person in order to associate the features to the labels 
acquired from the questionnaire. In case of regressions/ cor- 
relation tests, we also tested the correlation of the measures 
after the logit transformation, by first bounding the value 
scopes (finding minimum and maximum values for all mea- 
surements and scaling them to the 0.1 - 0.9 interval) and 
applying the log e (j^). 

4. RESULTS AND DISCUSSION 

4.1 Features 

First significance tests showed the correlation between the 
pure attention level with the percent of time the person was 
detected (Pearson’s r = 0.1158, p — 0.01, 577 samples). 
This can be explained with the idea that engaged students 
will maintain more contact with the activities in the class- 
room. Apart from being more visible, students head travel 
did not show significant difference on the overall scale. We 
expected this as the measurement itself can be easily affected 
by noisy measurements, even though we did take steps in 
smoothing the data. 

Head travel became significant when testing its potential 
to measure the change in behaviour. After eliminating the 
individual differences with normalization of head travel, we 
found that positive changes in attention were reflected in 
increase in head travel (Pearson’s r — 0.21, p < 0.01, 236 
samples), as shown in Figure 4. 

Of the measures of stillness, only “percentage of time spent 
still” recorded a significant, but very weak correlation (Pear- 
son’s r = 0.09, p = 0.02). After comparing it with the 
“percentage of time detected” we found a very high and sig- 
nificant correlation between the two measures (r = 0.91, 
p < 0.01), which does not allow for great significance of the 
measure. We kept the measures for further testing. 

4.2 Models 

Next step in demonstrating the usefulness of the features 
was to try to predict the attention levels based on their 
combinations. After initial attempts with linear regression 



Delta attention 

Figure 4: Change in normalized head travel corre- 
lated to the change in attention. Red line represents 
the linear fit. Pearson’s r = 0.21, p < 0.01. Number 
of samples 236. Noise added for the visualization 
after the linear fit. 

which were not successful, we switched to the mixture model. 
Our mixed model for logit attention (A) with period (P), row 
(R), number of still periods (N) and head travel normalized 
(H) takes form 

L(A) = 1.061 - 0.060P - 0.128P + 0.012V - 0.035P. 

Although its predictive power ( R 2 ran dom = 0-54 and R} ixed = 
0.05) is limited, significance encourages further investigation 
of more advance supervised learning methods. 

With that in mind, we tried an exhaustive search of all fea- 
ture combinations and SVM parameters to achieve the best 
prediction of the three categories of “labelled attention” - low 
(100 samples), medium (270 samples), high (246 samples). 
Training of the classifiers was repeated in several rounds 
(500 iterations) with random drawing of training and test- 
ing samples, while making sure that the ratio of samples for 
each output category is maintained (roughly 16%, 44% and 
40%). Our training procedure was based on the 80-20 split 

— 80% of the data used for training, and 20% data for test- 
ing the prediction of the trained classifier. To evaluate SVM 
parameters during the training we additionally split the 80% 
used for training into another 80-20 split. This gives us the 
final data configuration — 64-16-20 split, where 64% of the 
data was used for training, 16% for evaluating the SVM pa- 
rameters during the training, 20% for the final evaluation of 
the trained classifier. 

For each combination of features we iterated over the SVM 
parameters with sampling step of 0.1 (kernel type considered 

- linear, polynomial, rbf and their relevant parameters) . On 
the top scoring feature combinations we applied gradual re- 
finement of the parameter sampling step (step size was re- 
duced down in sequence 0.1, 0.01, 0.001 around the best 
scoring parameter values from the previous round). Four 
best scoring classifiers are given in Table 3, with the best 
result of 61.86% correct classifications (Cohen’s kappa 0.30) 
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58.33% 


this publication. 



Figure 5: Transition probabilities between the three 
attention levels (low, medium, high). 


5. CONCLUSION 

The goal of this study was not only to answer questions 
about the link between student’s movement and attention, 
but also to investigate to what extent can we approximate 
these variables by current techniques, without manual an- 
notation. We defined a number of head metrics that can 
be extracted from a video of the audience attending a class. 
Considering measures that are “global” in nature (not rely- 
ing on specific events such as gesturing, nodding etc.) we 
have shown that the change in head motion usage corre- 
lates with the change in reported level of attention. We also 
experimentally confirmed that higher percentage of head de- 
tection mirrors higher time spent in contact with the class- 
room events, indicating higher attentiveness. 


on the independent test set. 

Our concern was that the main informative source would 
rely on the Detection percentage or Percentage still , the two 
being highly correlated. This did happen in the early train- 
ing attempts, but the features are not represented in the 
final set of classifiers ( Detection percentage is used in the 
10th best classifier). All of the best classifiers included a 
similar mix of features - head motion representatives, and 
some indications of distance and time of the class. Normal- 
ized head-travel measurements and Mean duration of still 
periods appears to be the most salient feature (both used in 
3 of the 4 detectors). 

Even though we saw no significant correlation of attention 
with class period in the feature analysis, we also tested the 
“attention labelled” for Markov property and got highly in- 
formative transitions probabilities shown in Figure 4.2. The 
trend of remaining in the same state with lower possibilities 
of transition to neighbouring, although not directly relevant 
to the attention level definitely puts additional constraints 
on the predictions. In order integrate this knowledge into 
our model, the next step was to connect our SVM predic- 
tions (observational model) and temporal consistency (tran- 
sition probabilities) into a Hidden Markov Model, but due 
to time constraints we are unable to report the results in 


For classification tasks, we found that head measurements 
alone were not enough to give us definitive answers about 
the person’s attention. Each of the high-scoring classifiers 
used other contextual cues which related person’s actions to 
the temporal or spacial domain (e.g. class period, distance). 
Also, in this report we did not explore social-level cues - how 
the students actions are contrasted against their immediate 
environment or general classroom population. We have ex- 
pectations that these features will provide further contextual 
information, which will raise the precision of predictions. 

Apart from the “global” measurements, we are also look- 
ing to explore discrete gestures which can be detected with 
the system (e.g. nodding, yawning, turning), of which only 
“bowing the head down” was used at this stage, encoded 
within the “percentage of time detected”. The problem that 
we perceive is that the noise of the measurements was evi- 
dent in the current setup, and that relying on the features 
which are more sensitive will depend on further improve- 
ments in the computer vision algorithms. 

Our current conclusion is that the technology shows promise 
and that future investigations will bring higher accuracy and 
new tools to the classrooms. Our future work will try to 
work in parallel on finding more meaningful measures, and 
coordinate with the teachers to determine the best way to 
present the found information back to the teaching process. 
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Figure 3: Processing of detections, a) Overlaps between subjects areas. Each graph edge shows neighbouring 
students areas and potential for miss-assignment of detections, b) All detections over the duration of the 
class, coloured depending on the cluster to which they were assigned. 


Feature name 

Description 

Valid samples 

Period 

Period of the class (1-4), associated with the attention 

776 

Distance 

Distance from the teacher on a Cartesian plane of the classroom 

776 

Row 

Student’s row in the classroom 

776 

Detection percentage 

Percentage of the recorded time that the student was detected 

668 

Head travel 

Accumulated changes (deltas) of the head horizontal rotations over time. 

496 

Head travel (norm.) 

Head travel normalized over the measurements of the specific person 
in the class. 

482 

Number of still periods 

Number of periods (of minimal duration of 5 seconds) during which the 
head movement can be considered still 

668 

Mean still period duration 

Mean duration of the still period (as defined in the previous row) 

618 

Still time percentage 

Percentage of time within the attention period during which the head was 
still. 

668 

Attention 

Reported level of attention (1-10) 

715 

Attention labelled 

Attention reports mapped to categories low , medium , high 

715 


Table 2: Features used in the analysis. 


Kernel 

Features 

Score 

Cohen’s kappa 

RBF(c=1.31, g=0.0211) 

Distance, Head travel norm., Num. still periods 

61.86% 

0.30 

RBF(c=1.21, g=0.11) 

Period, Row, Head travel norm., Mean duration still 

61.72% 

0.32 

RBF(c=l.ll, g=0.061) 

Head travel norm., Mean duration still 

60.42% 

0.28 

RBF(c=1.4, g=0.04) 

Period, Distance, Row, Mean duration still 

59.23% 

0.30 


Table 3: Classifier scores for predicting “attention labelled”. Score given represent the prediction score on 
the 20% test sample. Parameters of the kernels are abbreviated as c - penalty for the error term; g - gamma. 
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